Introduction

Street trees especially in the urban areas contribute to environmental sustainability and biodiversity.Some of the benefits of street trees include reducing pollutants and carbon emissions, physical and mental health well being, floods prevention and increase in property value. New York City(NYC) being one of the world's largest city has been implementing various tree planting and protection programs with the help of volunteers and non-profit organizations.

For this project I have considered datset from NYC 2015 Street tree census collected by NYC Department of Parks and Recreation.The data was primarily collected from 5 boroughs(Manhattan, Bronx, Queens, Staten Island,Brooklyn) of NYC. The following link contains the data dictionary and provides more information about the dataset.

Load Packages

In [33]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

Import data

In [2]:
df=pd.read_csv("2015_Street_Tree_Census.csv")
df.head()
Out[2]:
tree_id block_id created_at tree_dbh stump_diam curb_loc status health spc_latin spc_common ... boro_ct state latitude longitude x_sp y_sp council district census tract bin bbl
0 180683 348711 8/27/2015 3 0 OnCurb Alive Fair Acer rubrum red maple ... 4073900 New York 40.723092 -73.844215 1027431.148 202756.7687 29.0 739.0 4052307.0 4.022210e+09
1 200540 315986 9/3/2015 21 0 OnCurb Alive Fair Quercus palustris pin oak ... 4097300 New York 40.794111 -73.818679 1034455.701 228644.8374 19.0 973.0 4101931.0 4.044750e+09
2 204026 218365 9/5/2015 3 0 OnCurb Alive Good Gleditsia triacanthos var. inermis honeylocust ... 3044900 New York 40.717581 -73.936608 1001822.831 200716.8913 34.0 449.0 3338310.0 3.028870e+09
3 204337 217969 9/5/2015 10 0 OnCurb Alive Good Gleditsia triacanthos var. inermis honeylocust ... 3044900 New York 40.713537 -73.934456 1002420.358 199244.2531 34.0 449.0 3338342.0 3.029250e+09
4 189565 223043 8/30/2015 21 0 OnCurb Alive Good Tilia americana American linden ... 3016500 New York 40.666778 -73.975979 990913.775 182202.4260 39.0 165.0 3025654.0 3.010850e+09

5 rows × 45 columns

Total number of rows and columns

In [3]:
df.shape
Out[3]:
(683788, 45)

Dataset information

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683788 entries, 0 to 683787
Data columns (total 45 columns):
tree_id             683788 non-null int64
block_id            683788 non-null int64
created_at          683788 non-null object
tree_dbh            683788 non-null int64
stump_diam          683788 non-null int64
curb_loc            683788 non-null object
status              683788 non-null object
health              652172 non-null object
spc_latin           652169 non-null object
spc_common          652169 non-null object
steward             652173 non-null object
guards              652172 non-null object
sidewalk            652172 non-null object
user_type           683788 non-null object
problems            652124 non-null object
root_stone          683788 non-null object
root_grate          683788 non-null object
root_other          683788 non-null object
trunk_wire          683788 non-null object
trnk_light          683788 non-null object
trnk_other          683788 non-null object
brch_light          683788 non-null object
brch_shoe           683788 non-null object
brch_other          683788 non-null object
address             683788 non-null object
postcode            683788 non-null int64
zip_city            683788 non-null object
community board     683788 non-null int64
borocode            683788 non-null int64
borough             683788 non-null object
cncldist            683788 non-null int64
st_assem            683788 non-null int64
st_senate           683788 non-null int64
nta                 683788 non-null object
nta_name            683788 non-null object
boro_ct             683788 non-null int64
state               683788 non-null object
latitude            683788 non-null float64
longitude           683788 non-null float64
x_sp                683788 non-null float64
y_sp                683788 non-null float64
council district    677269 non-null float64
census tract        677269 non-null float64
bin                 674229 non-null float64
bbl                 674229 non-null float64
dtypes: float64(8), int64(11), object(26)
memory usage: 234.8+ MB

Data Preparation

Drop irrelevant columns

In [5]:
tree_df = df.drop(['block_id',
              'spc_latin',
              'user_type',
              'address',
              'community board',
              'cncldist',
              'st_assem',
              'st_senate',
              'boro_ct',
              'state',
              'council district',
              'census tract',
              'bin',
              'bbl',
              'zip_city',
              'created_at',
              'postcode',
              'steward',
              'curb_loc'], axis = 1)

tree_df.shape
Out[5]:
(683788, 26)

Null values verification within each column

In [6]:
null_data = tree_df.isnull().sum()
null_data[null_data > 0]
Out[6]:
health        31616
spc_common    31619
guards        31616
sidewalk      31616
problems      31664
dtype: int64

Count of trees by status

In [7]:
tree_df['status'].value_counts()
Out[7]:
Alive    652173
Stump     17654
Dead      13961
Name: status, dtype: int64

Trees that are alive with missing species name

In [8]:
tree_df[(tree_df.status == "Alive") & (tree_df.spc_common.isnull())]
Out[8]:
tree_id tree_dbh stump_diam status health spc_common guards sidewalk problems root_stone ... brch_shoe brch_other borocode borough nta nta_name latitude longitude x_sp y_sp
356613 562532 4 0 Alive Good NaN None NoDamage Stones Yes ... No No 4 Queens QN49 Whitestone 40.791332 -73.803610 1038630.469 227641.3712
427541 630814 11 0 Alive Poor NaN NaN Damage NaN No ... No No 4 Queens QN45 Douglas Manor-Douglaston-Little Neck 40.771945 -73.750414 1053380.635 220615.7964
431417 651014 40 0 Alive Good NaN None Damage Stones Yes ... No No 4 Queens QN53 Woodhaven 40.686902 -73.859411 1023240.372 189564.7945
608632 47941 5 0 Alive Good NaN None NoDamage None No ... No No 4 Queens QN21 Middle Village 40.723484 -73.880296 1017429.853 202884.0907
656960 150745 3 0 Alive Good NaN None Damage None No ... No No 2 Bronx BX44 Williamsbridge-Olinville 40.894521 -73.858255 1023438.408 265207.8056

5 rows × 26 columns

The dataset contains a total of 652173 trees that are "Alive". The remaining of 31615 trees are either "Dead" or "Stumps". As mentioned in the data dictionary most of the data was not collected for stumps and dead trees and were accountable for null values.Species names("spc_common') were not included for five of the trees that are alive. For the purpose of this analysis data with stumps, dead trees and missing species name will be excluded.

Filtered data by trees that are alive and no missing values

In [9]:
flt_data = tree_df[(tree_df.status == "Alive") & (tree_df.spc_common.notnull())]
In [10]:
flt_data.shape
Out[10]:
(652168, 26)

Exploratory Data Analysis

Unique number of species

In [11]:
print(f"Total number of species: {flt_data['spc_common'].nunique()}")
Total number of species: 132

Trees Distribution

Count of trees in each borough

In [12]:
plt.figure(figsize = (10,8))

ax=sns.countplot(x="borough", data=flt_data, order = flt_data['borough'].value_counts().index, palette="muted")

ax.set(xlabel="NYC Boroughs", ylabel = "Total number of trees", title="Tree count in each of the boroughs")

for p in ax.patches:
       ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.2, p.get_height()+50))

Within the 5 boroughs, Queens has the highest number of trees and Manhattan have least number among the five boroughs.The land area could be the primary reasons for Manhattan having fewer trees comparitively. Manhattan is 22.8 sq miles and Queens is 108.1 sq miles in land area.

Top 20 Common Species

In [13]:
plt.figure(figsize = (15,8))

ax=sns.countplot(y="spc_common", data=flt_data, palette="muted", order = flt_data['spc_common'].value_counts().iloc[:20].index)

ax.set(xlabel="Total number of trees", ylabel = "Species name", title="Distribution of Species")

total = len(flt_data['spc_common'])

for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

London Planetree(13.3%) is predominent among the 132 species followed by honeylocust(9.9%).

Common Species by each borough

In [14]:
g=flt_data.groupby("borough")
In [15]:
plt.subplots_adjust(top=3.0, bottom=0.5,left=0.9, right=3.0, hspace=0.5, wspace=0.35)

plt.subplot(321)
sns.countplot(y="spc_common", data=g.get_group("Manhattan"),  order = flt_data['spc_common'].value_counts().iloc[:10].index)
plt.xlabel("Count")
plt.ylabel("Species")
plt.title("Manhattan")

plt.subplot(322)
sns.countplot(y="spc_common", data=g.get_group("Queens"),  order = flt_data['spc_common'].value_counts().iloc[:10].index)
plt.xlabel("Count")
plt.ylabel("Species")
plt.title("Queens")

plt.subplot(323)
sns.countplot(y="spc_common", data=g.get_group("Brooklyn"),  order = flt_data['spc_common'].value_counts().iloc[:10].index)
plt.xlabel("Count")
plt.ylabel("Species")
plt.title("Brooklyn")

plt.subplot(324)
sns.countplot(y="spc_common", data=g.get_group("Bronx"),  order = flt_data['spc_common'].value_counts().iloc[:10].index)
plt.xlabel("Count")
plt.ylabel("Species")
plt.title("Bronx")

plt.subplot(325)
sns.countplot(y="spc_common", data=g.get_group("Staten Island"),  order = flt_data['spc_common'].value_counts().iloc[:10].index)
plt.xlabel("Count")
plt.ylabel("Species")
plt.title("Staten Island")
Out[15]:
Text(0.5, 1.0, 'Staten Island')

For this countplot I have selected only the top 10 common species and checked there count in each of the boroughs.Honeylocust is widely spread across Bronx and Manhattan, whereas London Plane tree is predominant in Queens and Brooklyn. Callery Pear seems to be more common in Staten Island.Gingko and Sophora are less common in Staten Island compared to other boroughs.

Neighbourhoods and tree density per square mile

In [16]:
import json
import geopandas as gpd
from area import area
import plotly.graph_objects as go
import plotly.io as pio
In [17]:
nyc_data = json.load(open("NYC_geodata.geojson"))
In [18]:
d = {}
neighborhood = nyc_data["features"]
for n in neighborhood:
    code = n["properties"]["ntacode"]
    a = area(n["geometry"])/(1609*1609) # converts from m^2 to mi^2
    d[code] = a
In [32]:
flt_data["area"] = flt_data["nta"].map(d)
flt_data = flt_data.dropna(subset=["area"])
flt_data['count_trees'] = flt_data.groupby('nta')['nta'].transform('count')
flt_data["density"] = flt_data["count_trees"]/flt_data["area"]
In [20]:
flt_data.head()
Out[20]:
tree_id tree_dbh stump_diam status health spc_common guards sidewalk problems root_stone ... borough nta nta_name latitude longitude x_sp y_sp area count_trees density
0 180683 3 0 Alive Fair red maple None NoDamage None No ... Queens QN17 Forest Hills 40.723092 -73.844215 1027431.148 202756.7687 2.077128 7330 3528.911193
1 200540 21 0 Alive Fair pin oak None Damage Stones Yes ... Queens QN49 Whitestone 40.794111 -73.818679 1034455.701 228644.8374 2.480408 7252 2923.712109
2 204026 3 0 Alive Good honeylocust None Damage None No ... Brooklyn BK90 East Williamsburg 40.717581 -73.936608 1001822.831 200716.8913 1.405719 2179 1550.096650
3 204337 10 0 Alive Good honeylocust None Damage Stones Yes ... Brooklyn BK90 East Williamsburg 40.713537 -73.934456 1002420.358 199244.2531 1.405719 2179 1550.096650
4 189565 21 0 Alive Good American linden None Damage Stones Yes ... Brooklyn BK37 Park Slope-Gowanus 40.666778 -73.975979 990913.775 182202.4260 1.527021 6097 3992.740464

5 rows × 29 columns

In [21]:
import math
import plotly.express as px
import plotly.offline as pyo
In [22]:
fig = px.choropleth_mapbox(flt_data,
                           geojson=nyc_data,
                           locations="nta",
                           featureidkey="properties.ntacode",
                           color="density",
                           color_continuous_scale="viridis",
                           mapbox_style="carto-positron",
                           zoom=9, center={"lat": 40.7, "lon": -73.9},
                           opacity=0.7,
                           hover_name="nta_name"
                           )


fig.show()
fig.write_html("myplot.html")

Upper East side-Carniege Hill(6298.716/sq mile),Central Harlem South(4975.432/sq mile), and Upper West Side(4634.076/sq mile) in Manhattan are among the top 5 neighbourhoods with higher tree density along with Brooklyn Heights-Cobble Hill(4789.675/sq mile) in Brooklyn and Fordham South(4421.47/sq mile) in Bronx. New Springville-Bloomfield-Travis(692.7113/sq mile) and Todt Hill-Emerson Hill-Heartland Village-Lighthouse hill(698.0125/sq mile) of Staten island have the lowest tree density.

Tree Diameter Distribution

Tree diameter descriptive statistics

In [23]:
flt_data.tree_dbh.describe()
Out[23]:
count    652168.000000
mean         11.709478
std           8.634185
min           0.000000
25%           5.000000
50%          10.000000
75%          16.000000
max         425.000000
Name: tree_dbh, dtype: float64

Trees with diameter of 425 inches

In [24]:
print((flt_data[['spc_common',"borough","nta_name"]][flt_data['tree_dbh'] == 425]))
           spc_common   borough             nta_name
2405  swamp white oak  Brooklyn  Crown Heights North

Diameter outlier detection and removal

In [25]:
plt.figure(figsize=(8,6))
ax = sns.boxplot(x=flt_data["tree_dbh"], palette="Set3")
plt.title('Tree Diameter Boxplot', fontsize=15)
plt.xlabel('Tree Diameter', fontsize=10)
plt.show()
In [26]:
cols = ['tree_dbh'] 

Q1 = flt_data[cols].quantile(0.25)
Q3 = flt_data[cols].quantile(0.75)
IQR = Q3 - Q1

dbh_df = flt_data[~((flt_data[cols] < (Q1 - 1.5 * IQR)) | (flt_data[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

Diameter histogram

In [27]:
plt.figure(figsize=(8,6))

lbins=[0,5,10,15,20,25,30,35,40]
n, bins, patches = plt.hist(dbh_df['tree_dbh'], bins=lbins, facecolor = '#2ab0ff', 
                            edgecolor='#169acf', linewidth=0.5,alpha=0.7)

for i in range(len(patches)):
    patches[i].set_facecolor(plt.cm.viridis(n[i]/max(n)))
    
patches[0].set_fc('red') # Set color
patches[0].set_alpha(1) # Set opacity

plt.title('Tree Diameter distribution', fontsize=15)
plt.xlabel('Tree Diameter', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()

More than half of the trees are within the range of 10 inches which indicates that the trees are very narrow or could be younger trees.

Mean tree diameter by borough

In [28]:
from numpy import mean

plt.figure(figsize=(8,6))
ax = sns.barplot(x="borough", y="tree_dbh", data=dbh_df, estimator=mean)
plt.xlabel("borough")
plt.ylabel("Average Tree Diameter")
plt.title("Tree Diameter-Boroughs")
Out[28]:
Text(0.5, 1.0, 'Tree Diameter-Boroughs')

Mean diameter of honeylocust tree

In [29]:
(dbh_df['tree_dbh'][dbh_df['spc_common'] == "honeylocust"]).mean()
Out[29]:
10.177760473446504

As Manhattan have most number of honeylocust, the mean diameter of these trees is 10.18 which indicates the trees are very narrow and average diameter of trees in Manhattan being relatively low compared to other boroughs.

Tree health

In [30]:
x,y = 'borough', 'health'

df1 = flt_data.groupby(x)[y].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()

g = sns.catplot(x=x, y='percent',hue=y,kind='bar',data=df1, palette="muted", height=6, aspect=1.0)
g.ax.set_ylim(0,100)

for p in g.ax.patches:
    txt = str(p.get_height().round(2)) + '%'
    txt_x = p.get_x() 
    txt_y = p.get_height()
    g.ax.text(txt_x,txt_y,txt)

Bronx has the highest percentage of trees in good condition followed by Queens and Staten Island.

In [31]:
problem_data = flt_data[['root_stone', 'root_grate', 'root_other', 
                         'trunk_wire', 'trnk_light','trnk_other',
                         'brch_light','brch_shoe', 'brch_other']]

problem_data.apply(pd.Series.value_counts)
Out[31]:
root_stone root_grate root_other trunk_wire trnk_light trnk_other brch_light brch_shoe brch_other
No 512171 648632 621846 638894 651137 619595 589803 651757 627813
Yes 139997 3536 30322 13274 1031 32573 62365 411 24355

Problems caused by lights and wires in the branches are highest compared to other issues.

Further analysis of this study can be performed on effect on air pollution, housing market, treatment plans for the damaged trees, allocation of spaces for new trees.

Code References:

1] T.Hikaru Clark, New to Data Visualization? Start with New York City(2020), Medium.

2] How To Annotate Bars in Barplot with Matplotlib in Python? datavizpyr · May 29, 2020 ·

3] A.Turin, How To Make Your Histogram Shine(2018), Medium